This document outlines the R code that was used across the lectures for Advanced Statistics course at MSc in Psychological Research Methods programme at the University of Sheffield, UK. The code follows the lectures, but also expands on the content covered by the lecture.

Lecture 1: Linear regression

Introduction

First, we used the dataset that are included in R packages. We can see what data do we have in standard R packages by calling: data() or we can see datasets specific for certain R packages by calling: data(package=‘lme4’).

In this case, we can call pre-loaded dataset cars. More information on this data is available by calling ?cars or help(cars).

data("cars")  #calls the dataset
head(cars)  # prints first 6 observations in the dataset
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

We can use scatter plot to see the raw values.

plot(cars$speed, cars$dist, xlab = "Predictor", ylab = "Outcome")  #As this plot was only used to make an illustration of the linear regression, we changed the labels of x and y-axis. This was done using xlab and ylab parameters. You can change the name of the labels using the same code, eg. xlab='Speed (mph)', ylab='Stopping distance (ft)'
abline(lm(dist ~ speed, data = cars), lwd = 2)  #abline function adds one or more straight lines through the plot that you have active in your environment. lm function is used to fit linear models, where dist is modelled as a function of speed. We also indicate that values for distance and speed can be found in the cars dataset (data = cars). Finally, we specify the thickness of abline function with lwd=2 parameter. 

Data simulation

We can also simulate some data:

set.seed(456)  # we can set starting numbers that are used to generate our random values. Random numbers are rarely truly random: https://engineering.mit.edu/engage/ask-an-engineer/can-a-computer-generate-a-truly-random-number/
Babies = data.frame(Age = round(runif(100, 1, 30)), Weight = rnorm(100, 4000, 500))  #We create a new data frame (Babies) where we have Age and Weight as variables. 100 Age values are sampled from random uniform distribution (runif) with lower bound of 1 (minimum) and upper bound of 30 (maximum). 100 Weight values are generated using random normal distribution (rnorm) with mean of 4000 and SD of 500 
Babies$Height = rnorm(100, 40 + 0.2 * Babies$Age + 0.004 * Babies$Weight, 5)  # Height is generated using random normal distribution where mean is a function of Age and Weight, while SD is 5. 
Babies$Gender = factor(rbinom(100, 1, 0.5))  # 100 Gender values are generated using random binomial function with equal probability of being assigned one or the other sex category
levels(Babies$Gender) = c("Girls", "Boys")  #We levels function to assign Girls and Boys labels to 1 and 0 levels generated by the function

We can plot and inspect raw data:

par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)  # par parameter sets global plotting settings. mfrow indicates number of panels for plots (1 row and 3 columns), bty sets type of box around the plot, mar defines margines around the plot, cex magnifies size of labels, while pch sets type of points used in the plot.
plot(Babies$Age, Babies$Height, xlab = "Age (months)", ylab = "Height (cm)")
plot(Babies$Weight, Babies$Height, xlab = "Weight (grams)", ylab = "Height (cm)")
boxplot(Babies$Height ~ Babies$Gender, xlab = "Gender", ylab = "Height (cm)")

We can also check the coding and formatting of the variables in our data frame.

str(Babies)
## 'data.frame':    100 obs. of  4 variables:
##  $ Age   : num  4 7 22 26 24 11 3 9 8 12 ...
##  $ Weight: num  3668 3872 4339 4448 4309 ...
##  $ Height: num  61.5 56.1 59.3 55.2 60.3 ...
##  $ Gender: Factor w/ 2 levels "Girls","Boys": 1 1 2 2 2 2 1 1 1 1 ...

Finally, we can also load the dataset from our local disc drives. The datasets should be placed in your working environment. There are a number of tutorials that guide you how to optimise creation of the new projects, which results in all of your data, code and files being generated and saved at one place. https://r4ds.had.co.nz/workflow-projects.html

inequality <- read.table("inequality.txt", sep = "\t", header = T)  # read a file in table format and create a data frame from it. sep parameter defines separator in the data - tab delimited format in this case, while it could be also anything else, such as ',' - comma separated values, ' ' - space separated values etc. Header = T defines that the names of the variables are in the first row of the database.

# You can also numerous other extensions, suchs spss sav files, however you
# will often need additional packages for this.  Foreign package -
# library(foreign) has read.spss function that can be used to read in spss
# files dataExample<-read.spss('Example.sav', to.data.frame=T)

Linear regression

One predictor

lm1 <- lm(Height ~ Age, data = Babies)  # linear model where Height is modelled as a function of Age. 
lm1$coefficients  # We can print only the coefficients
## (Intercept)         Age 
##   57.025804    0.143174
summary(lm1$coefficients)  #We can also part of the information that is frequently used to judge significance and importance of predictors  
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1432 14.3638 28.5845 28.5845 42.8051 57.0258

Two predictors

lm2 <- lm(Height ~ Age + Gender, data = Babies)  # linear model where Height is analysed as a function of Age and Gender
lm2$coefficients
## (Intercept)         Age  GenderBoys 
##  57.5038799   0.1403955  -0.8309449

Main effects and interaction (numerical x categorical)

lm3 <- lm(Height ~ Age * Gender, data = Babies)  # linear model where Height is modelled as a function of Age and Gender, as well as their interaction
lm3$coefficients
##    (Intercept)            Age     GenderBoys Age:GenderBoys 
##     56.1448603      0.2202400      1.8315868     -0.1607307

Main effects and interaction (numerical x numerical)

lm4 <- lm(Height ~ Age * Weight, data = Babies)
lm4$coefficients
##   (Intercept)           Age        Weight    Age:Weight 
## 30.7244904854  0.6913148965  0.0066360329 -0.0001397745

Other information that we get from the linear model

lm1 <- lm(Height ~ Age, data = Babies)
summary(lm1)
## 
## Call:
## lm(formula = Height ~ Age, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4765  -4.1601  -0.3703   3.9198  12.3842 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.02580    1.18751  48.021   <2e-16 ***
## Age          0.14317    0.06426   2.228   0.0282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.283 on 98 degrees of freedom
## Multiple R-squared:  0.04821,    Adjusted R-squared:  0.0385 
## F-statistic: 4.964 on 1 and 98 DF,  p-value: 0.02817

We can calculate R2 step-by-step. First, we need predictions from the model that includes predictor, as this will give us residual sum of squares.

Babies$lm1 = predict(lm1, newdata = Babies)  # predict Height based on our lm1 model.
Babies$diff = Babies$lm1 - Babies$Height  #calculate differences between predicted (lm1) and observed values (Height) 

We can plot these differences using ggplot.

require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = Height, ymax = lm1,
    colour = diff), size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm1),
    size = 1.2) + geom_point(data = Babies, aes(Age, y = Height), size = 2) + ylab("Height") +
    xlab("Age") + ggtitle("SS_residual")

A second ingredient for our determination coefficient is sum of squares of total variation in the data. This we can get by building model with intercept only.

lm0 <- lm(Height ~ 1, data = Babies)
summary(lm0)
## 
## Call:
## lm(formula = Height ~ 1, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4165  -4.2284  -0.2062   3.6744  13.5940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.3953     0.5388   110.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.388 on 99 degrees of freedom
Babies$lm0 = predict(lm0, newdata = Babies)  #predict Height based on average value only
Babies$diff2 = Babies$lm0 - Babies$Height  #calculate difference between predicted (lm0) and observed values (Height)

We can plot these differences using ggplot.

ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = Height, ymax = lm0,
    colour = diff2), size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm0),
    size = 1.2) + geom_point(data = Babies, aes(Age, y = Height), size = 2) + ylab("Height") +
    xlab("Age") + ggtitle("SS_total")

The R2 coefficient is:

1 - (sum(Babies$diff^2)/(sum(Babies$diff2^2)))
## [1] 0.04821098

Improvement in our prediction:

Babies$diff3 = Babies$lm1 - Babies$lm0  #Improvement based on the inclusion of Age as a predictor - differences between the predicted values from (lm1) and intercept only model (lm0)
ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = lm1, ymax = lm0, colour = diff3),
    size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm0), size = 1.2) + geom_line(data = Babies,
    aes(Age, y = lm1), size = 1.3, linetype = "dotdash") + geom_point(data = Babies,
    aes(x = Age, y = Height), size = 0.9) + ylab("Height") + xlab("Age") + ggtitle("Improvement") +
    theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12))

F-value:

(sum(Babies$diff3^2)/1)/(sum(Babies$diff^2)/98)
## [1] 4.963995

Practical aspect of the lecture

inequality <- read.table("inequality.txt", sep = ",", header = T)  #Reading the data in R
head(inequality)
##        state median_household_income share_unemployed_seasonal
## 1    Alabama                   42278                     0.060
## 2     Alaska                   67629                     0.064
## 3    Arizona                   49254                     0.063
## 4   Arkansas                   44922                     0.052
## 5 California                   60487                     0.059
## 6   Colorado                   60940                     0.040
##   share_population_in_metro_areas share_population_with_high_school_degree
## 1                            0.64                                    0.821
## 2                            0.63                                    0.914
## 3                            0.90                                    0.842
## 4                            0.69                                    0.824
## 5                            0.97                                    0.806
## 6                            0.80                                    0.893
##   share_non_citizen share_white_poverty gini_index share_non_white
## 1              0.02                0.12      0.472            0.35
## 2              0.04                0.06      0.422            0.42
## 3              0.10                0.09      0.455            0.49
## 4              0.04                0.12      0.458            0.26
## 5              0.13                0.09      0.471            0.61
## 6              0.06                0.07      0.457            0.31
##   share_voters_voted_trump hate_crimes_per_100k_splc
## 1                     0.63                0.12583893
## 2                     0.53                0.14374012
## 3                     0.50                0.22531995
## 4                     0.60                0.06906077
## 5                     0.33                0.25580536
## 6                     0.44                0.39052330
##   avg_hatecrimes_per_100k_fbi
## 1                   1.8064105
## 2                   1.6567001
## 3                   3.4139280
## 4                   0.8692089
## 5                   2.3979859
## 6                   2.8046888

Probability density plots: outcomes

par(mfrow = c(1, 2))
plot(density(inequality$hate_crimes_per_100k_splc, na.rm = T), main = "Crimes per 100k")  #probability density for hate crimes outcomes. na.rm=T indicates that only cases that are not NAs should be returned
plot(density(inequality$avg_hatecrimes_per_100k_fbi, na.rm = T), main = "Average Crimes")

Probability density plots: predictors

par(mfrow = c(1, 2))
plot(density(inequality$median_household_income, na.rm = T), main = "Income")
plot(density(inequality$gini_index, na.rm = T), main = "Gini")

Scatter plots:

par(mfrow = c(1, 2))
plot(inequality$median_household_income, inequality$avg_hatecrimes_per_100k_fbi,
    xlab = "Median household income", ylab = "Avg hatecrimes")
plot(inequality$gini_index, inequality$avg_hatecrimes_per_100k_fbi, xlab = "Gini index",
    ylab = "Avg hatecrimes")

cor(inequality[, c(2, 8, 12)], use = "complete.obs")  #calculate correlations where we use only rows that have all values (no NAs). We are only focusing on columns: 2,8 and 12. inequality is a data frame with (n rows, m columns), so we can access only first row by calling inequality[1,], or just first column by inequality[,1], first row and first column would be inequality[1,1]. When using inequality[,c(2,8,12)] am asking for all rows but only second, eight and twelfth column.
##                             median_household_income gini_index
## median_household_income                   1.0000000 -0.1497444
## gini_index                               -0.1497444  1.0000000
## avg_hatecrimes_per_100k_fbi               0.3182464  0.4212719
##                             avg_hatecrimes_per_100k_fbi
## median_household_income                       0.3182464
## gini_index                                    0.4212719
## avg_hatecrimes_per_100k_fbi                   1.0000000

Stepwise approach in building linear model

mod1 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income, data = inequality)  #modelling avg hatecrimes as a function of median_household_income
summary(mod1)
## 
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income, 
##     data = inequality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3300 -1.1183 -0.1656  0.7827  7.7762 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             -9.564e-01  1.448e+00  -0.661   0.5121  
## median_household_income  6.054e-05  2.603e-05   2.326   0.0243 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.642 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1013, Adjusted R-squared:  0.08256 
## F-statistic: 5.409 on 1 and 48 DF,  p-value: 0.0243

Adding a new predictor and comparing it with the previous model.

mod2 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index, data = inequality)
anova(mod2)  #anova type comparison of the model that allows us to see whether newly added predictor explained additional variation in the outcome. We can see changes in the Sum Sq for residuals and for the main effects
## Analysis of Variance Table
## 
## Response: avg_hatecrimes_per_100k_fbi
##                         Df Sum Sq Mean Sq F value    Pr(>F)    
## median_household_income  1 14.584  14.584  7.0649 0.0107093 *  
## gini_index               1 32.389  32.389 15.6906 0.0002517 ***
## Residuals               47 97.020   2.064                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also test whether there is need to adjust influence of median household income across the values of gini index - interaction between our predictors

mod3 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index, data = inequality)
anova(mod1, mod2, mod3)  # comparison between three models
## Analysis of Variance Table
## 
## Model 1: avg_hatecrimes_per_100k_fbi ~ median_household_income
## Model 2: avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index
## Model 3: avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     48 129.41                                  
## 2     47  97.02  1    32.389 19.742 5.539e-05 ***
## 3     46  75.47  1    21.550 13.135 0.0007218 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary of the model:

summary(mod3)
## 
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income * 
##     gini_index, data = inequality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.16383 -0.96279 -0.01053  0.88008  2.75763 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         9.056e+01  3.070e+01   2.950 0.004986 ** 
## median_household_income            -1.724e-03  4.965e-04  -3.472 0.001136 ** 
## gini_index                         -2.001e+02  6.666e+01  -3.002 0.004330 ** 
## median_household_income:gini_index  3.907e-03  1.078e-03   3.624 0.000722 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.281 on 46 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4759, Adjusted R-squared:  0.4417 
## F-statistic: 13.92 on 3 and 46 DF,  p-value: 1.367e-06

We can visualise the interactions using interactions package.

require(interactions)
## Loading required package: interactions
## Warning: package 'interactions' was built under R version 4.1.1
interact_plot(mod3, pred = median_household_income, modx = gini_index, plot.points = T)

Interactive visualisation:

simulatedD <- data.frame(median_household_income = rep(seq(35500, 76165, by = 100),
    13), gini_index = rep(seq(0.41, 0.53, by = 0.01), each = 407))
simulatedD$Avg_hate_pred <- predict(mod3, newdata = simulatedD)  #We need to simulate full matrix of observations - each combination of Gini index and Median Income (increments of 0.1 and 100, respectivelly). Then we predict observations based on our model and use the Simulated data as our predictors. 
p <- ggplot(simulatedD, aes(median_household_income, Avg_hate_pred, color = gini_index,
    frame = gini_index)) + geom_point()  # Using ggplot to make the plot
plotly::ggplotly(p)  #make it interactive using plotly

Model diagnostics

Quantile-quantile plot: Normality

car::qqPlot(mod3)

## [1]  9 18

Homoscedasticity: Linearity

car::residualPlot(mod3, type = "rstudent")  # we can call car package without loading it into R environment by calling car::

Outliers:

car::influenceIndexPlot(mod3)  #influence of individual observation on our model

Autocorrelation of residuals

par(bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
stats::acf(resid(mod3))

Predicted versus observed data:

par(bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(predict(mod3), mod3$model$avg_hatecrimes_per_100k_fbi, xlab = "Predicted values (model 3)",
    ylab = "Observed values (avg hatecrimes)")  #The x-axis is predict(mod3) - predict values based on our model. The y-axis is the data that was used to build the model. I decided to take these values from our model frame (mod3), instead of original data frame (inequalities). 

Finally, we can subset our model and exclude data point that might skew our results

mod3.cor <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index,
    data = inequality, subset = -9)  #We subset our data to exclude row 9
summary(mod3.cor)
## 
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income * 
##     gini_index, data = inequality, subset = -9)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.90964 -0.94966 -0.08526  0.73470  2.55257 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         4.429e+01  3.080e+01   1.438    0.157
## median_household_income            -7.992e-04  5.228e-04  -1.529    0.133
## gini_index                         -9.668e+01  6.725e+01  -1.438    0.157
## median_household_income:gini_index  1.837e-03  1.145e-03   1.604    0.116
## 
## Residual standard error: 1.154 on 45 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1285, Adjusted R-squared:  0.07041 
## F-statistic: 2.212 on 3 and 45 DF,  p-value: 0.09974

Lecture 2: Logistic regression

Data simulation

We can also simulate categorical outcomes.

set.seed(456)
Babies = data.frame(Age = round(runif(100, 1, 30)), Weight = rnorm(100, 4000, 500))
Babies$Height = rnorm(100, 40 + 0.2 * Babies$Age + 0.004 * Babies$Weight, 5)
Babies$Gender = rbinom(100, 1, 0.5)
Babies$Crawl = rbinom(100, 1, 0.031 * Babies$Age + 1e-05 * Babies$Weight - 0.06 *
    Babies$Gender)  #I simulated Crawling data from random binomial distribution, where I took out 100 times 1 sample. Probability of success was defined in relation to Babies Age, Weigh and Gender. 
Babies$Gender = as.factor(Babies$Gender)  # I recode these numbers to factor
levels(Babies$Gender) = c("Girls", "Boys")  # Assigning labels to Gender factor
table(Babies$Crawl)
## 
##  0  1 
## 37 63

Plotting probability density function for Number of babies crawling in our data

par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(1:100, dbinom(1:100, 0.63, size = 100), xlab = "N of babies crawling", ylab = "Probability",
    type = "l")

Relation between log odds, odds, and probabilities

require(ggplot2)  #load in ggplot2
logit <- data.frame(LogOdds = seq(-2.5, 2.5, by = 0.1), Pred = seq(-2.5, 2.5, by = 0.1))  #create data frame where variable LogOdds containts numbers from -2.5 to 2.5 changing by 0.1 
logit$Odds = exp(logit$LogOdds)  #exponentiate logits that results in odds
logit$Probabilities = logit$Odds/(1 + logit$Odds)  #transform odds ratios to probabilities 

ggplot(data = logit, aes(x = Pred, y = Odds)) + geom_point(size = 2) + theme_bw() +
    ylim(0, 13) + theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12))  #plotting odds

Plotting Log-Odds:

ggplot(data = logit, aes(x = Pred, y = LogOdds)) + geom_point(size = 2) + theme_bw() +
    ylim(-4, 4) + theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12))

Plotting probabilities:

ggplot(data = logit, aes(x = Pred, y = Probabilities)) + geom_point(size = 2) + theme_bw() +
    ylim(0, 1) + theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12))

We can again check what is in our simulated data:

par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(Babies$Age, Babies$Height, xlab = "Age (months)", ylab = "Height (cm)")
boxplot(Babies$Height ~ Babies$Gender, xlab = "Gender", ylab = "Height (cm)")
boxplot(Babies$Age ~ Babies$Crawl, xlab = "Crawl", ylab = "Age (months)")  #Boxplot of Babies Age across our Crawling outcome, xlab - label of x-axis, ylab - label of y-axis

Logistic regression model

Let’s build first logistic regression model:

glm1 <- glm(Crawl ~ Age, data = Babies, family = binomial(link = "logit"))  #glm function is used to fit generalized linear models (try typing in your console: ?glm). Crawl is modelled as a function of Age and we are using Babies data. Family specified type of the outcome distribution. We are saying that this our oucome follows Binomial distribution, while we want to use logit link - logOdds transformation. 
glm1$coefficients  #we are printing only coefficients
## (Intercept)         Age 
##  -1.3307752   0.1194777

Let’s get Odds:

glm1$coefficients
## (Intercept)         Age 
##  -1.3307752   0.1194777
exp(glm1$coefficients)  #we can use exponential function to transform our coefficients to Odds ratios - how more likely it is that babies are going to start crawling if they are older by 1 month (beta coefficient - slope)
## (Intercept)         Age 
##   0.2642723   1.1269081

Let’s get probabilities:

1/(1 + exp(1.33078))  # only intercept
## [1] 0.2090304
1/(1 + exp(1.33078 - 0.11948 * 10))  #What is the probability of babies starting to crawl when they are 10 months?
## [1] 0.4660573
arm::invlogit(coef(glm1)[[1]] + coef(glm1)[[2]] * mean(Babies$Age))  # what is the probability of babies starting to crawl when they are mean of their age (around 16 months). I used invlogit function to automatically calculate probabilities. coef(glm1)[[1]] - gives me intercept value, while coef(glm1)[[2]] - gives me slope value for age.
## [1] 0.6562395

We can plot inner workings of our model: Logit, odds and probabilities

Babies$LogOdds = -1.33078 + 0.11948 * Babies$Age  #Outputing logit values based on our model
Babies$Odds = exp(Babies$LogOdds)  #Transforming them to odds 
Babies$Probs = Babies$Odds/(1 + Babies$Odds)  # Transforming odds to probabilities
par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(Babies$Age, Babies$LogOdds)
plot(Babies$Age, Babies$Odds)
plot(Babies$Age, Babies$Probs)

Lets see what goes into the model and how our model sees the data:

ggplot(data = Babies, aes(x = Age, y = Probs)) + geom_point(size = 2, col = "blue") +
    geom_point(data = Babies, aes(x = Age, y = Crawl), size = 2, col = "red")  #Red points show our dependent outcome - 0,1; blue points show estimated probabilities across the values of Age. 

Summary of the results

summary(glm1)  #Summary of the complete model
## 
## Call:
## glm(formula = Crawl ~ Age, family = binomial(link = "logit"), 
##     data = Babies)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0687  -0.9876   0.5443   0.8979   1.6082  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.33078    0.50868  -2.616 0.008894 ** 
## Age          0.11948    0.03079   3.880 0.000104 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 131.79  on 99  degrees of freedom
## Residual deviance: 113.35  on 98  degrees of freedom
## AIC: 117.35
## 
## Number of Fisher Scoring iterations: 3

We can calculate whether our proposed model is improvement in comparison to the null (only intercept model) - statistically significant:

with(glm1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))  #I am asking R to take values from my glm1 object (our model), where I am asking him to take p-values from chi-square distribution where my values are difference between null deviance model and our my fitted model, differences in the degrees of freedom, and we are looking at the right side of the probability distribution with lower.tail=FALSE
## [1] 1.751679e-05

Practical aspect:

basketball <- read.table("Basketball.txt", sep = "\t", header = T)  #Loading the data in R. Data file is .txt file, where separator is tab delimited (tab delimited values) and the names of my variables are in the first row (header=T)
knitr::kable(head(basketball[, c(5, 13, 18, 31, 34, 43)]), format = "html")  #Printing only certain columns (numbers in c())
Home X3PointShots FreeThrows. Opp.3PointShots. Opp.FreeThrows. Win
1 Away 13 0.529 0.308 0.818 0
32 Away 6 0.867 0.323 0.750 1
83 Home 8 0.652 0.368 0.769 1
112 Home 12 0.684 0.481 0.941 0
165 Away 7 0.769 0.364 0.652 0
196 Away 7 0.611 0.500 0.650 1

Let’s plot the data to see how it looks like:

table(basketball$Win)  #Tabulate outcome
## 
##  0  1 
## 69 81
par(mfrow = c(1, 2), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(density(basketball$X3PointShots.), main = "", xlab = "3POintsShots")
plot(density(basketball$Opp.3PointShots.), main = "", xlab = "Opp3PointsShots")

We can also cross-tabulate the data. Focus on combinations of categories:

knitr::kable(table(basketball$Win, basketball$Home), format = "html")
Away Home
0 43 26
1 28 53
datA = aggregate(cbind(basketball$X3PointShots., basketball$Opp.3PointShots., basketball$FreeThrows.,
    basketball$Opp.FreeThrows.), list(basketball$Win), mean)  #aggregate function aggregates our data. I used this function to calculate arithmetic mean for X3POintsShots, Opp3Points shots, FreeThrows, OppFreeThrows for each outcome value (0,1). cbind is used to join all numerical values together in one dataframe. In other words, we are not aggregating one by one numerical variable, as cbind allows us to do that jointly. Instead of mean, you can use sd or min or max or whatever you want.  
names(datA) = c("Win", "X3POints_mean", "Opp3POints_mean", "FreeThrows_mean", "OppFreeThrows_mean")  #as my aggregate returns new data frame without the names of the variables I just quickly attach the names to them.
knitr::kable(datA, format = "html")  #Writting it out
Win X3POints_mean Opp3POints_mean FreeThrows_mean OppFreeThrows_mean
0 0.3200580 0.3914928 0.7498696 0.7724928
1 0.3840494 0.3252593 0.7752099 0.7541975

Plotting predictors and categorical outcome

par(mfrow = c(1, 2), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(basketball$X3PointShots., basketball$Win)
plot(basketball$X3PointShots., jitter(basketball$Win, 0.5))

Let’s make first model (one predictor):

baskWL1 <- glm(Win ~ Home, data = basketball, family = binomial("logit"))  #Main effect of Home
summary(baskWL1)
## 
## Call:
## glm(formula = Win ~ Home, family = binomial("logit"), data = basketball)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4909  -1.0015   0.8935   0.8935   1.3642  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.4290     0.2428  -1.767 0.077296 .  
## HomeHome      1.1412     0.3410   3.346 0.000819 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 206.98  on 149  degrees of freedom
## Residual deviance: 195.33  on 148  degrees of freedom
## AIC: 199.33
## 
## Number of Fisher Scoring iterations: 4

Model with two predictors:

baskWL2 <- glm(Win ~ Home + X3PointShots., data = basketball, family = binomial("logit"))  #Main effect of home and X3PointsShots.
summary(baskWL2)
## 
## Call:
## glm(formula = Win ~ Home + X3PointShots., family = binomial("logit"), 
##     data = basketball)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9950  -1.0185   0.5857   0.9993   1.8696  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.5847     0.6977  -3.705 0.000212 ***
## HomeHome        1.1151     0.3568   3.125 0.001779 ** 
## X3PointShots.   6.1575     1.8229   3.378 0.000731 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 206.98  on 149  degrees of freedom
## Residual deviance: 182.56  on 147  degrees of freedom
## AIC: 188.56
## 
## Number of Fisher Scoring iterations: 4

Model comparison (model 2 versus model 1)

anova(baskWL1, baskWL2, test = "LR")  #compare two models where we use likelihood ratio test - for generalized linear models  
## Analysis of Deviance Table
## 
## Model 1: Win ~ Home
## Model 2: Win ~ Home + X3PointShots.
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       148     195.34                          
## 2       147     182.56  1   12.773 0.0003516 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Three predictors:

baskWL3 <- glm(Win ~ Home + X3PointShots. + Opp.3PointShots., data = basketball,
    family = binomial("logit"))  #Main effect of Home, X3PointsShots. and Opp.3PointsShots.
anova(baskWL1, baskWL2, baskWL3, test = "LR")  #comparing three models
## Analysis of Deviance Table
## 
## Model 1: Win ~ Home
## Model 2: Win ~ Home + X3PointShots.
## Model 3: Win ~ Home + X3PointShots. + Opp.3PointShots.
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       148     195.34                          
## 2       147     182.56  1   12.773 0.0003516 ***
## 3       146     166.93  1   15.635 7.683e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interactions:

baskWL4 <- glm(Win ~ Home * X3PointShots. + Opp.3PointShots., data = basketball,
    family = binomial("logit"))  #Three main effects and interaction between Home and X3PointsShots
anova(baskWL3, baskWL4, test = "LR")  #comparing third and fourth model
## Analysis of Deviance Table
## 
## Model 1: Win ~ Home + X3PointShots. + Opp.3PointShots.
## Model 2: Win ~ Home * X3PointShots. + Opp.3PointShots.
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       146     166.93                     
## 2       145     166.85  1 0.078666   0.7791

Visualising our model with two predictors:

basketball$Prob_mod2 = predict(baskWL2, type = "response")
ggplot(data = basketball, aes(x = X3PointShots., y = Prob_mod2, colour = Home)) +
    geom_point(size = 2) + geom_point(data = basketball, aes(x = X3PointShots., y = Win),
    size = 2, col = "blue")

Visualising our model with three predictors

basketball$Prob_mod3 = predict(baskWL3, type = "response")
ggplot(data = basketball, aes(x = X3PointShots., y = Prob_mod3, colour = Home)) +
    geom_point(size = 2) + geom_point(data = basketball, aes(x = X3PointShots., y = Win),
    size = 2, col = "blue")

It is quite nice to report confidence intervals in your work:

confint(baskWL3)  #confidence intervals for logit estimates in our third model
## Waiting for profiling to be done...
##                        2.5 %    97.5 %
## (Intercept)       -2.0130842  1.684825
## HomeHome           0.3598188  1.847523
## X3PointShots.      3.0259724 10.586459
## Opp.3PointShots. -11.2785830 -3.544063
exp(confint(baskWL3))  #confidence intervals for odds ratios in our third model
## Waiting for profiling to be done...
##                         2.5 %       97.5 %
## (Intercept)      1.335761e-01 5.391508e+00
## HomeHome         1.433070e+00 6.344088e+00
## X3PointShots.    2.061404e+01 3.959502e+04
## Opp.3PointShots. 1.264077e-05 2.889568e-02

Let’s see how accurate is our model:

Ctabs <- table(basketball$Win, basketball$Prob_mod3 > 0.5)
Ctabs
##    
##     FALSE TRUE
##   0    47   22
##   1    19   62

Lecture 3: Mixed-effect models

Theoretical part

Simulation of the data with hierarchical categories

set.seed(456)
# Fixed effects
alpha_0 <- 500  #intercept
beta_1 <- 50  #slope 
sigma <- 100  #sd

# by-intercept sd, by_slope sd and correlation between intercept and slope sd
tau_0 <- 30  # by-group random intercet (countries)

tau_1 <- 30  # by-group random slope (countries)
rho <- 0

n_babies <- 10

n_rfx <- faux::rnorm_multi(n = n_babies, mu = 0, sd = c(tau_0, tau_1), r = rho, varnames = c("T_0s",
    "T_1s"))

babies_rfx = data.frame(T_0s = rep(n_rfx$T_0s, each = 200), T_1s = rep(n_rfx$T_1s,
    each = 200))

Babies <- data.frame(Babies_id = rep(1:10, each = 200), babies_rfx)

Babies$Age = round(runif(2000, 1, 30))
Babies$Surounding = rnorm(2000, Babies$T_0s, 20)
Babies$Weight = rnorm(2000, 20, 10)
Babies$CalorieIntake = alpha_0 + Babies$T_0s + (beta_1 + Babies$T_1s) * Babies$Age +
    4 * Babies$Surounding + 5 * Babies$Weight + rnorm(2000, 0, sigma)
table(Babies$Babies_id)  #How many observations we have for each group
## 
##   1   2   3   4   5   6   7   8   9  10 
## 200 200 200 200 200 200 200 200 200 200

Complete pooling model

Linear model where we ignore information about the countries:

mod1CP <- lm(CalorieIntake ~ Age, data = Babies)
summary(mod1CP)
## 
## Call:
## lm(formula = CalorieIntake ~ Age, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1544.49  -226.93    79.93   295.13  1035.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  470.846     20.356   23.13   <2e-16 ***
## Age           49.603      1.143   43.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 428.1 on 1998 degrees of freedom
## Multiple R-squared:  0.485,  Adjusted R-squared:  0.4848 
## F-statistic:  1882 on 1 and 1998 DF,  p-value: < 2.2e-16

Looking at residuals:

par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(resid(mod1CP)[1:200], ylab = "Residuals", xlab = "Index")

No pooling model

Linear model where we estimates parameters for each country separately:

mod1NP <- lm(CalorieIntake ~ Age + factor(Babies_id) - 1, data = Babies)
summary(mod1NP)
## 
## Call:
## lm(formula = CalorieIntake ~ Age + factor(Babies_id) - 1, data = Babies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -809.29 -160.35   -6.41  162.10  855.07 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## Age                   50.2790     0.6783  74.124  < 2e-16 ***
## factor(Babies_id)1   102.5872    20.9058   4.907 9.99e-07 ***
## factor(Babies_id)2   684.2652    20.6108  33.199  < 2e-16 ***
## factor(Babies_id)3   862.3240    21.2219  40.634  < 2e-16 ***
## factor(Babies_id)4  -305.3031    20.8831 -14.620  < 2e-16 ***
## factor(Babies_id)5   487.4414    20.9251  23.295  < 2e-16 ***
## factor(Babies_id)6   132.7962    21.0883   6.297 3.72e-10 ***
## factor(Babies_id)7   659.7740    20.8901  31.583  < 2e-16 ***
## factor(Babies_id)8   656.3745    20.7328  31.659  < 2e-16 ***
## factor(Babies_id)9   679.1863    20.4171  33.266  < 2e-16 ***
## factor(Babies_id)10  642.8545    20.7242  31.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 253.2 on 1989 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9666 
## F-statistic:  5259 on 11 and 1989 DF,  p-value: < 2.2e-16

Looking at residuals:

par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(resid(mod1CP)[1:200], ylab = "Residuals", xlab = "Index")

Multilevel model: intercept

# install.packages('lme4')
require(lme4)

mult1 <- lmer(CalorieIntake ~ Age + (1 | Babies_id), data = Babies)
summary(mult1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: CalorieIntake ~ Age + (1 | Babies_id)
##    Data: Babies
## 
## REML criterion at convergence: 27858.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1994 -0.6327 -0.0245  0.6410  3.3694 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  Babies_id (Intercept) 132275   363.7   
##  Residual               64118   253.2   
## Number of obs: 2000, groups:  Babies_id, 10
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 460.2558   115.6422    3.98
## Age          50.2773     0.6783   74.12
## 
## Correlation of Fixed Effects:
##     (Intr)
## Age -0.092

Multilevel model: intercept and slope adjustments

mult2 <- lmer(CalorieIntake ~ Age + (1 + Age | Babies_id), data = Babies)
summary(mult2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: CalorieIntake ~ Age + (1 + Age | Babies_id)
##    Data: Babies
## 
## REML criterion at convergence: 25474.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7037 -0.6672  0.0194  0.6576  4.0965 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev. Corr 
##  Babies_id (Intercept) 35937.8  189.57        
##            Age           701.9   26.49   -0.50
##  Residual              18938.3  137.62        
## Number of obs: 2000, groups:  Babies_id, 10
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  469.579     60.307   7.786
## Age           50.048      8.386   5.968
## 
## Correlation of Fixed Effects:
##     (Intr)
## Age -0.497

Comparison of residuals

par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 1, 0.1), cex = 1.1, pch = 16)
plot(resid(mod1CP)[1:200], ylab = "Residuals", xlab = "Index", main = "Complete pooling")
plot(resid(mod1NP)[1:200], ylab = "Residuals", xlab = "Index", main = "No pooling")
plot(resid(mult2)[1:200], ylab = "Residuals", xlab = "Index", main = "Partial pooling")

Fixed and random effects

fixef(mult2)
## (Intercept)         Age 
##   469.57946    50.04775
ranef(mult2)
## $Babies_id
##    (Intercept)        Age
## 1    273.96266 -40.083989
## 2    -75.56727  19.518621
## 3     15.73010  22.659262
## 4   -130.81904 -40.356098
## 5    362.47248 -21.416831
## 6   -129.73763 -12.337377
## 7   -105.14316  18.855336
## 8     94.77461   6.176257
## 9   -209.91822  29.259636
## 10   -95.75452  17.725183
## 
## with conditional variances for "Babies_id"

Intra-class correlation (ICC)

mod1 <- lmer(CalorieIntake ~ Age + (1 | Babies_id), data = Babies)
performance::icc(mod1)
## # Intraclass Correlation Coefficient
## 
##     Adjusted ICC: 0.674
##   Unadjusted ICC: 0.354

Significance testing: fixed effects

# install.packages('lmerTest')
require(lmerTest)
mult2 <- lmer(CalorieIntake ~ Age + (1 + Age | Babies_id), data = Babies)
print(summary(mult2), cor = F)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: CalorieIntake ~ Age + (1 + Age | Babies_id)
##    Data: Babies
## 
## REML criterion at convergence: 25474.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7037 -0.6672  0.0194  0.6576  4.0965 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev. Corr 
##  Babies_id (Intercept) 35937.8  189.57        
##            Age           701.9   26.49   -0.50
##  Residual              18938.3  137.62        
## Number of obs: 2000, groups:  Babies_id, 10
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  469.579     60.307   8.995   7.786 2.75e-05 ***
## Age           50.048      8.386   9.001   5.968 0.000211 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Practical aspect

# install.packages('foreign')
require(foreign)  #to read-in the data that are SPSS format (.sav) we need foreign package
mwell <- read.spss("data.sav", to.data.frame = T)  #read.spss function and we specify that we should read it as a data frame
dim(mwell)  #dimensions of our data (number of rows and columns)
## [1] 120115     74
mwell$Total = mwell$Watchwe_adj + mwell$Watchwk_adj + mwell$Comphwe_adj + mwell$Comphwk_adj +
    mwell$Smartwe_adj + mwell$Smartwk_adj  #total amount of time spent watching screen in a week

Number of missing values in our outcome:

table(is.na(mwell$mwb))
## 
##  FALSE   TRUE 
## 102580  17535

Numbe of missing values in our predictor:

table(is.na(mwell$Watchwk))
## 
##  FALSE   TRUE 
## 117638   2477

Density plots for our variables

par(mfrow = c(1, 2), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(density(mwell$mwb, na.rm = TRUE), main = "")
plot(density(mwell$Total, na.rm = T), main = "")

Subsetting the data - excluding NAs:

mwell2 = mwell[!is.na(mwell$mwb) & !is.na(mwell$Total), ]  # we are trying to subset our data frame and take only values that are not NAs. is.na is a boolean function that tells us whether one row is NA or not (TRUE or FALSE). !is.na indicates that we do not want TRUE na.values. We also have a logical parameter & that combines two conditions !is.na(outcome) & !is.na(predictor). Finally, we would like to filter our dataset by row and exclude all the rows that have either TRUE value in our predictor or our outcome. Therefore, we are looking at rows mwell[function goes here,] instead of mwell[,function goes here ] which would look at columns 
dim(mwell2)  #dimensions of the smaller data
## [1] 98853    75

Scatter plot:

cor(mwell2$mwb, mwell2$Total)
## [1] -0.1724534
par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(mwell2$Total[1:500], mwell2$mwb[1:500])

Number of observations for each of our potential random structures:

table(mwell2$Ethnicg)
## 
##            White            Mixed            Asian            Black 
##            76428             3827             9537             4413 
## Other or unknown 
##             4648
table(mwell2$REGION)
## 
## East Midlands                                                            
##                                                                     6440 
## East of England                                                          
##                                                                     7612 
## London                                                                   
##                                                                    18524 
## North East                                                               
##                                                                     6826 
## North West                                                               
##                                                                    15276 
## South East                                                               
##                                                                    13160 
## South West                                                               
##                                                                    10105 
## West Midlands                                                            
##                                                                    10328 
## Yorkshire and the Humber                                                 
##                                                                    10582

Building the model:

MWmod1 <- lmer(mwb ~ (1 | LANAME), data = mwell2)  #random effect of Local area
MWmod2 <- lmer(mwb ~ (1 | LANAME) + (1 | Ethnicg), data = mwell2)  #crossed random effects of local area and ethnicity
MWmod3 <- lmer(mwb ~ (1 | REGION/LANAME) + (1 | Ethnicg), data = mwell2)  #nested random effect of local area that is nested in region and crossed with ethnicity
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00241727 (tol = 0.002, component 1)
anova(MWmod1, MWmod2, MWmod3)  #comparison of the models
## refitting model(s) with ML (instead of REML)
## Data: mwell2
## Models:
## MWmod1: mwb ~ (1 | LANAME)
## MWmod2: mwb ~ (1 | LANAME) + (1 | Ethnicg)
## MWmod3: mwb ~ (1 | REGION/LANAME) + (1 | Ethnicg)
##        npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)   
## MWmod1    3 726331 726359 -363162   726325                        
## MWmod2    4 726326 726364 -363159   726318 7.1022  1   0.007699 **
## MWmod3    5 726323 726371 -363157   726313 4.2095  1   0.040198 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Building the model: Fixed structure 1

MWmod2a <- lmer(mwb ~ Total + (1 | LANAME) + (1 | Ethnicg), data = mwell2)  #main effect of total
print(summary(MWmod2a), cor = F)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: mwb ~ Total + (1 | LANAME) + (1 | Ethnicg)
##    Data: mwell2
## 
## REML criterion at convergence: 723288.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.0004 -0.6215  0.0686  0.6822  3.0084 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  LANAME   (Intercept)  0.2160  0.4647  
##  Ethnicg  (Intercept)  0.2936  0.5418  
##  Residual             87.9915  9.3804  
## Number of obs: 98853, groups:  LANAME, 150; Ethnicg, 5
## 
## Fixed effects:
##               Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)  5.109e+01  2.594e-01  4.583e+00  196.96 3.53e-10 ***
## Total       -2.054e-01  3.695e-03  9.801e+04  -55.59  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Building the model: Fixed structure 2

MWmod2b <- lmer(mwb ~ Total + male + (1 | LANAME) + (1 | Ethnicg), data = mwell2)  #main effect of total and main effect of sex
MWmod2c <- lmer(mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg), data = mwell2)  #interaction between total and sex
anova(MWmod2a, MWmod2b, MWmod2c)  #comparison of the models 
## refitting model(s) with ML (instead of REML)
## Data: mwell2
## Models:
## MWmod2a: mwb ~ Total + (1 | LANAME) + (1 | Ethnicg)
## MWmod2b: mwb ~ Total + male + (1 | LANAME) + (1 | Ethnicg)
## MWmod2c: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg)
##         npar    AIC    BIC  logLik deviance   Chisq Df Pr(>Chisq)    
## MWmod2a    5 723288 723335 -361639   723278                          
## MWmod2b    6 717197 717254 -358592   717185 6093.02  1  < 2.2e-16 ***
## MWmod2c    7 716978 717044 -358482   716964  221.25  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(summary(MWmod2c), cor = F)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg)
##    Data: mwell2
## 
## REML criterion at convergence: 716985.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.1959 -0.6181  0.0586  0.6695  3.3167 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  LANAME   (Intercept)  0.1892  0.4349  
##  Ethnicg  (Intercept)  0.3367  0.5802  
##  Residual             82.5528  9.0859  
## Number of obs: 98853, groups:  LANAME, 150; Ethnicg, 5
## 
## Fixed effects:
##                 Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)    4.884e+01  2.835e-01  5.101e+00  172.26 8.42e-11 ***
## Total         -1.979e-01  4.928e-03  9.864e+04  -40.17  < 2e-16 ***
## maleYes        2.921e+00  1.328e-01  9.881e+04   21.99  < 2e-16 ***
## Total:maleYes  1.084e-01  7.285e-03  9.880e+04   14.88  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random structure: random slopes

MWmod3c <- lmer(mwb ~ Total * male + (1 + Total | LANAME) + (1 | Ethnicg), data = mwell2)  #random slopes for total predictor for Local area and intercept for local area 
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
## Warning: Model failed to converge with 1 negative eigenvalue: -8.9e+00
MWmod3c <- lmer(mwb ~ Total * male + (1 | LANAME:male) + (1 | Ethnicg), data = mwell2)  #random intercept for unique combination between local area and sex 
anova(MWmod2c, MWmod3c)  #comparison of the models
## refitting model(s) with ML (instead of REML)
## Data: mwell2
## Models:
## MWmod2c: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg)
## MWmod3c: mwb ~ Total * male + (1 | LANAME:male) + (1 | Ethnicg)
##         npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## MWmod2c    7 716978 717044 -358482   716964                    
## MWmod3c    7 716995 717062 -358491   716981     0  0
MWmod3c <- lmer(mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg:male), data = mwell2)  #random intercepts for unique combination between Ethnicity and sex
anova(MWmod2c, MWmod3c)
## refitting model(s) with ML (instead of REML)
## Data: mwell2
## Models:
## MWmod2c: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg)
## MWmod3c: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg:male)
##         npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## MWmod2c    7 716978 717044 -358482   716964                    
## MWmod3c    7 716955 717022 -358471   716941 22.41  0
summary(MWmod3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: mwb ~ (1 | REGION/LANAME) + (1 | Ethnicg)
##    Data: mwell2
## 
## REML criterion at convergence: 726315.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6003 -0.6199  0.0771  0.6881  2.4872 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev.
##  LANAME:REGION (Intercept)  0.18076 0.4252  
##  REGION        (Intercept)  0.03315 0.1821  
##  Ethnicg       (Intercept)  0.08082 0.2843  
##  Residual                  90.74939 9.5262  
## Number of obs: 98853, groups:  LANAME:REGION, 150; REGION, 9; Ethnicg, 5
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  47.5569     0.1568  5.2742   303.3 2.16e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00241727 (tol = 0.002, component 1)

Visualisation of the random structure:

require(sjPlot)
plot_model(MWmod3c, type = "re", sort.est = "sort.all", grid = FALSE)[1]  #type='re' gives us random adjustments, while [1] indicates that we want only first plot
## [[1]]

plot_model(MWmod3c, type = "re", sort.est = "sort.all", grid = FALSE)[2]
## [[1]]

Visualisation of the fixed effects (interaction):

plot_model(MWmod3c, type = "int")

Significance of the random effects

ranova(MWmod3c)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## mwb ~ Total + male + (1 | LANAME) + (1 | Ethnicg:male) + Total:male
##                    npar  logLik    AIC    LRT Df Pr(>Chisq)    
## <none>                7 -358480 716974                         
## (1 | LANAME)          6 -358523 717059 86.708  1  < 2.2e-16 ***
## (1 | Ethnicg:male)    6 -358525 717061 88.990  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explained variance - R2

# install.packages('MuMIn')
require(MuMIn)
r.squaredGLMM(MWmod3c)  # m stands for marginal, while c stands for conditional. Marginal is approximation of the explained variance by fixed-effect structure, while conditional is with both fixed and random-effect structure
##             R2m        R2c
## [1,] 0.08319687 0.08936068

Predictions of the model

mwell2$predicted = predict(MWmod3c)  #prediction values from model to our train dataset
par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(mwell2$predicted, mwell2$mwb)  #plot predicted versus observed data points